Airbnb is now a prevalent choice for renting a house when people are traveling. New York is a famous place to travel, attracting millions of tourists every year. In this EDAV project, we select Airbnb’s data at New York City (NYC) and want to find some interesting patterns in Airbnb’s house data. We are mainly interested in the following questions:
What are the factors affecting the house price? What kind of variables are inter-correlated with each other?
What is the trend of house demand?
What are people’s opinions on Airbnb’s house?
P.S. Github resp: https://github.com/alexliyihao/EDAV_final_project.git
We used the data opened by airbnb at website http://insideairbnb.com/get-the-data.html For NYC, we used following data:
listings.csv: Summary information and metrics for listings in New York City
listings.csv.gz: Detailed listings data for New York City
reviews.csv: Summary Review data and Listing ID
reviews.csv.gz Detailed Review Data for listings in New York City
In listings.csv file, there are 48377 records of room listings, and typical variable includes:
price: the price for the room per night.
neighborhood_group: the room’s location area, including Bronx, Brooklyn, Manhattan, Queens and Staten Island.
reviews_per_month: The number of reviews that a listing has received per month.
Reviews.csv contains the comments to those Airbnb listings. It has 6 variables, which are listing ID, comment ID, date, reviewer’s ID, reviewer’s name and comments. The size of the dataset is about 400 MB, and there are only 19 missing values in this dataset. The languages used for comments also varies, including Japanese, Spanish, Chinese, etc. The dates of comments range from 2009 to 2019.
Airbnb provide a gigantic dataset with a great number of variables
For each housing, it provides information from users’ url to the amenity tag to the position(longitude and latitude), etc.
amenities (The list of amenity string tags) is not converted into a string. for it will be used with a specific API, whose performance is better with factor type.
For all the money value with “$x,xxx.xx” factor format, change it into clean numerical form.
The original dataset is too big (in size it’s not, but the strings works horrible with R), we write a lighter version out for further studying. All the following research in on this cleaner dataset.
We first divided reviews.csv by year. To prepare for the word cloud, we first do the text mining to the separated data and converted the resulting words into matrices, then save them as comments_20xx.RData. Since there are too many comments in 2018 and 2019, we decide to take only 100,000 samples in each year.
For time series analysis, we created several variables. Based on review.csv, which contains users’ reviews and dates on the room, we could create time-related variables. First, for every room, we regard the first date the room was reviewed as the first date this room ‘joined’ Airbnb. Then we could then count the number of rooms joined Airbnb as time changes. Second, we regard the number of rooms reviewed every month as the demand. And we can count this number for different boroughs, room types, etc.
For most of the information we need a factor format, we will use read.csv rather than read_csv here. As an id, Host_id should be a factor rather than numeric, but its factor format will be removed once we re-read the csv, we will re-add it manually everytime..
Most of the rows are without any missing patterns.
The missing value mainly appear on the following three field: The house’s hardware: bathrooms, bedrooms,beds, rare Pricing: security_deposit and cleaning_fee very common Reviews: The review_scores_rating and reviews_per_month, common
Common missing patterns: Missing all the review related (review_scores_rating and reviews_per_month): the housing may had no visitors before Missing security_deposit only: A lot of hosts actually don’t give security depo. Missing all the pricing related (security_deposit and cleaning_fee): same idea. Missing all the review and pricing related: the intersection of the host mentioned above. Thus the missing pattern is relatively following the common sense, there’s no too many missing patterns not following the common sense.
In this section, we want to concentrate on the relationships between different numeric variables in listing.csv.
First, we examed that there are several outliers for the numeric variables.
In these four variables, there are plenty of outliers. In order to check the patterns, we omitted these outliers. For review_scores_rating, most of the review scores are above 80, but we did not omit the lower score because we still want to include the low scores. We also include availability_365 and room_type as variables of the plot. We drew this parallel coordinate plot clustered by boroughs.
Clustered by boroughs, we can see that the price range of Bronx and Queens are similar (approximate from 25 to 200). Manhattan has the largest price range from 50 up to 300, and on the other hand, Staten Island has the smallest price range from 50 up to 100, and this borough contains the least amount of Airbnbs.
Although most of the review scores are in the range of (80, 100), Bronx contains the review scores most above 90, and these rooms have a price of around 50 to 100. If we select only the private room in the room type, we can see that in the Bronx, most private rooms have the price of around 50 with a score above 90. A lot of these private rooms available above 300 days in a year. The Bronx has a relatively low cleaning fee for private rooms (below 50) than other boroughs have the cleaning fee below 100. Similarly, the security deposit for private rooms is lower in the Bronx. Moreover, the review scores of the entire home or apartment in the Bronx are approximately above 95. The Bronx only contains a few hotel rooms and shared rooms.
We have some similar cases in Queens. Private rooms with a price around 50 have a low security deposit and cleaning fee and a high above 90 review scores. In addition, the entire home or apartment has a higher above 90 review scores and with a price higher than both Queen’s private room and Bronx’s entire home and apartment. Also, Queens only contains a few hotel rooms and shared rooms. In all, the Bronx and Queens have several similarities, and the reason might be the residential environments, including room type, in two boroughs are similar.
Things are different in the other three boroughs. The price range of Brooklyn and Manhattan are similar (50 to 300). However, the private rooms in Manhattan are more expensive (50 to 150) than in Brooklyn (50 to 100). Also, the price of the entire home or apartment in Manhattan is a little higher than in Brooklyn. Most of hotel rooms are from Manhattan and Brooklyn. Staten Island has the least amount of Airbnbs, and the price is around 100.
Ignoring the cluster by boroughs, we can see that many entire homes or apartments are available more than 300 days in a year, but many private rooms are available less than 100 days in a year. Private rooms contain fewer accommodates, but entire homes or apartments contain more accommodates. Most of hotel rooms do not require any security deposit and cleaning fee; many of them contain 2 accommodates. The shared room contains 2 or 1 accommodates. The price of the shared room is low, and mostly this type of room does not require security deposit and has a low cleaning fee.
Download the geojson map of NYC, read both the geojson map and dataset
On the price side, for general intuition, check the density
Which gives a extremely right-skewed data, in order to have a practical scale: calculate the upper hinge.
## [1] "upper_hinge = 334"
a upper_hinge of 334 looks relatively small with respect to the density curve. Let’s check
Check the 99% percentile:
## 99%
## 800
For a safe and relatively small scale, let’s split the distribution into following tags: lower than upper hinge 334, 334-800(99% percentile) and greater than 800 (very expensive, can be considered as somewhat the outlier)
Have a try on interactive(d) https://alexlyh.shinyapps.io/interactive_price/: Plot the spatial distribution of housing, try different values, notice 1.the price over 800 may not create large difference. 2. Too large price interval, especially the larger value side will ruin the color scale.
Some conclusion after multiple trys:
Manhattan and West Brooklyn have most of the expensive housing, no matter what scale is selected. i.e. The distribution of price in these places are more skewed to cheaper side. Bronx, Queens generally provide cheap housing (You can easily clarify it by setting minimum price > 200, which will filter out almost every housing in Bronx and Queens)
2.2.1 Macro aspect (in aggregate supply)
First, as all of the data analysis on NYC, let’s divide the
For a spacial intuition. Plot the hexagon heatmap supply density on the NYC map:
The result follows both the histogram and common sense: Brooklyn and Manhattan provide the most dense supply.
2.2.2 Micro aspect(in individual supply)
With calculated_host_listings_count, which is the count of housing provided by this supplier.
To our surprising, the ratio of host who hosts more than 1 housing is not small at all. At least 30% of the host is offering 2 or more.
## [1] "The 70 percentile of calculated_host_listings_count is: 2"
There’s even some “mega-host” in NYC, this can be checked with interactive(c)
Let’s intuitively have a large host be the one provides greater or equal to 5 housing, who is more likely to be a commercial rather than for-pocket-money one.
Generally, the large host and small host have the same price distribution. The difference is the large host gives more at 0-50(very cheap) and 200-350(slightly more expensive) part.
We did the time series analysis on the number of new rooms at NYC and monthly demand. Following are our findings #####3.1 Trend of Room Supply For the number of total rooms, in Manhattan and Brooklyn, the trend of that is quite similar, their increasing rate started to be really high after 2015. In contrast, in Queens, the room number started to increase at a relatively high rate after 2017, indicating that this area possibly has more and more demands in the future.
3.2.1 Aggregate trend in NYC For room demands every month, we found that more and more people are living in Airbnb room when traveling to NYC. More interestingly, there is some seasonality in the demand of the house, and then we can use decomposition to detect those seasonal effects.
3.2.2 Decomposed Trend in NYC First, the trend curve in the decomposition suggests that the demand for Airbnb kept increasing in those years. Moreover, there is a clear seasonality in the demand time series. In winter, especially in February, the demand for Airbnb is extremely low in NYC relative to other months. At this time, there might be fewer travelers to NYC. After March, the demand tends to increase more, and in August and September, the demand for Airbnb is the relatively highest in a year. The possible reason for that is in summer there are more tourists visiting NYC instead of winter. (NYC is extremely cold in winter! (Might compare this data with South cities))
The pattern that the demand is low in winter and high in summer is similar to NYC’s for most of those cities, even for some southern cities, such as San Francisco and Los Angles! Those might indicate that people are more likely to travel in the summertime instead of wintertime.
However, for some southern cities, such as New Orleans and Mexico City, they are quite popular in the wintertime, and the demands in the summertime are relatively lower.
Also, the seasonal factor in Tokyo suggests that most people are going to Tokyo at May. The reason might be that people are going to visit sakura, which is a symbol for Japan and blooms in May.
Then back to our data at NYC, and dive into the demands in more details. We check the demands at different areas and different room type.
The plot shows that there is a strong co-movement for areas Brooklyn, Manhattan, and Queens. Also, for the room type, the entire room and private room moves in a similar way, and the other two kinds of rooms’ demand are not too many.
(a). Amenities https://edavfinalproject.shinyapps.io/wordcloud_amenities/ We used word cloud in Shiny Apps to display the most listed amenities. When we set the minimum frequency to 10,185 and maximum number of words to 61, we can see that the 6 most mentioned words are: detector, kitchen, conditioning, heating, essentials and air. We guess that “air” and “conditioning” should be the combined to “air conditioning”, but they are separated because we divided those word groups by whitespaces in text mining process. Also, we can see that “monoxide” and “carbon” are also shown for a lot of times, this is because of the law enforcement in New York that households need to have at least one carbon-monoxide detector in each room. “Extinguisher” also appears many times, which also means that the hosts of Airbnb really value the safety of their guests’ life and wealth. One interesting feature that we found is that Airbnb in New York seems to serve for business people, because words like “workspace” and “laptop” appear for a significant amount of times, which might not be commonly seen in other cities. The rest of words are basically the living essentials that can be seen in most of Airbnb listings.
(b). Reviews https://edavfinalproject.shinyapps.io/wordcloud_reviews/ Since the amount of comments exploded in recently years, the highlighted words also appear in different frequencies across years. In 2019, the two most commonly seen words are “great” and “place”, which are probably the simplest praise of a place. “Recommend” also has a relatively high frequency, so we know that people who would like to comment on Airbnb mostly like to recommend this place for the potential guests. However, since the amount of comments increased significantly in the past years, we also suspect whether those comments are actually made by “real people” rather than the hosts themselves. Words like “walking”, “restaurants”, “park”, “neighborhood”, “safe” are all about the nearby communities, which means that people not only care about the environment inside the rooms, but also the nearby area. “quiet” is also mentioned for many times, and as we all know that this is probably the most precious thing in NYC. What surprised us is the word “kitchen” is also mentioned for many times, since we all think that people come to NYC can easily get access to a lot of excellent restaurants.
2018 basically shows the same words as in 2019. However, in 2017, we noticed that the word “shops” is mentioned a lot. We are curious about why would people stop mentioning this word in 2018 and 2019. Did people’s purchasing power decreases in these years?
(c). Big Host https://alexlyh.shinyapps.io/interactive_large_host/ Used in exploratory analysis in the distribution of big host and their behaviour.
(d). Spatial distribution of price https://alexlyh.shinyapps.io/interactive_price/ Used in exploratory analysis in the spatial distribution of price.
The number of features and data in the data provided by Airbnb is limited. We could use include more variables such as the crime rate at specific areas. We might also include other data such as hotel prices.
We can do further analysis on cases with extreme price and low rating scores.
Based on the occupation rate of a house, we could provide some recommendations for a host, such as lowering the price. Also, we can give some suggestions on which house to rent based on their preferences.
For the spatial analysis, the key point is we didn’t find a geojson data with borough name—which stops me plotting a choropleth with the crime rate and join them into my analysis. After consideration we picked the most objective data only in data from Airbnb.
For the word cloud, we have discussed about which method we should use for text mining, since we worried about the words like “TV” and “Cable TV” are repetitive, while the word detector might point to different stuffs. Also, since the number of amounts varies across years, it is hard to use a uni-size slide bar. It could be better if we are able to adjust the slide bar according to the maximum frequency of the words.